Our Friend Least Squares Regression
Fitting A Model
The easiest model.
Call:
lm(formula = y ~ x, data = df)
Coefficients:
(Intercept) x
17.280 2.625
Model Summaries
Summaries of the overal model components.
Call:
lm(formula = y ~ x, data = df)
Residuals:
Min 1Q Median 3Q Max
-7.9836 -4.0182 -0.8709 5.3064 6.9909
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.280 4.002 4.318 0.00255 **
x 2.626 0.645 4.070 0.00358 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.859 on 8 degrees of freedom
Multiple R-squared: 0.6744, Adjusted R-squared: 0.6337
F-statistic: 16.57 on 1 and 8 DF, p-value: 0.003581
Components within the Summary Object
Like cor.test(), the object returned from lm() and its summary object have several internal components that you may use.
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
[1] "call" "terms" "residuals" "coefficients"
[5] "aliased" "sigma" "df" "r.squared"
[9] "adj.r.squared" "fstatistic" "cov.unscaled"
Components within the Summary Object
The probability can be found by looking at the data in the F-Statistic and then asking the F-distribution for the probability associated with the value of the test statistic and the degrees of freedom for both the model and the residuals.
value numdf dendf
16.56838 1.00000 8.00000
The Summary Object
When you print out the summary( fit ) object, it uses the fstatistic to estimate a P-value. If we need that P-value directly, this is how it is done.
What Makes One Model Better
There are two parameters that we have already looked at that may help. These are:
New Data Set
Let’s start by looking at some air quality data, that is built into R as an example data set.
Ozone Solar.R Wind Temp
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
NA's :37 NA's :7
Base Models - What Influences Ozone
Individually, we can estimate a set of first-order models.
\[
y = \beta_0 + \beta_1 x_1 + \epsilon
\]
Base Models - What Influences Ozone
Model parameters predicting mean ozone in parts per billion mearsured in New York during the period of 1 May 1973 - 30 September 1973 as predicted by Temperature, Windspeed, and Solar Radiation.
| Ozone ~ Solar |
0.121 |
1.79e-04 |
| Ozone ~ Temp |
0.488 |
0.00e+00 |
| Ozone ~ Wind |
0.362 |
9.27e-13 |
More Complicated Models
Multiple Regression Model - Including more than one predictors.
\(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\)
More Complicated Models
Model parameters predicting mean ozone in parts per billion mresured in New York during the period of 1 May 1973 - 30 September 1973.
| Ozone ~ Solar |
0.121 |
1.79e-04 |
| Ozone ~ Temp |
0.488 |
0.00e+00 |
| Ozone ~ Wind |
0.362 |
9.27e-13 |
| Ozone ~ Temp + Wind |
0.569 |
0.00e+00 |
| Ozone ~ Temp + Solar |
0.510 |
0.00e+00 |
| Ozone ~ Wind + Solar |
0.449 |
9.99e-15 |
For Completeness
How about all the predictors. \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\)
For Completeness
| Ozone ~ Solar |
1.21e-01 |
1.79e-04 |
| Ozone ~ Temp |
4.88e-01 |
0.00e+00 |
| Ozone ~ Wind |
3.62e-01 |
9.27e-13 |
| Ozone ~ Temp + Wind |
5.69e-01 |
0.00e+00 |
| Ozone ~ Temp + Solar |
5.10e-01 |
0.00e+00 |
| Ozone ~ Wind + Solar |
4.49e-01 |
9.99e-15 |
| Ozone ~ Temp + Wind + Solar |
6.06e-01 |
0.00e+00 |
\(R^2\) Inflation
Any variable added to a model will be able to generate Sums of Squares (even if it is a small amount). So, adding variables may artifically inflate the Model Sums of Squares.
Example:
What happens if I add just random data to the regression models? How does \(R^2\) change?
Random Data Effects
Original data in models.
| Ozone ~ Temp |
0.4877 |
| Ozone ~ Wind |
0.3619 |
| Ozone ~ Solar |
0.1213 |
| Ozone ~ Temp + Wind |
0.5687 |
| Ozone ~ Temp + Solar |
0.5103 |
| Ozone ~ Wind + Solar |
0.4495 |
| Ozone ~ Temp + Wind + Solar |
0.6059 |
Random Data Effects
Original full model + X random variables
| Ozone ~ Temp + Wind + Solar + 1 Random Variables |
0.6091 |
| Ozone ~ Temp + Wind + Solar + 2 Random Variables |
0.6176 |
| Ozone ~ Temp + Wind + Solar + 3 Random Variables |
0.6317 |
| Ozone ~ Temp + Wind + Solar + 4 Random Variables |
0.6444 |
| Ozone ~ Temp + Wind + Solar + 5 Random Variables |
0.6449 |
| Ozone ~ Temp + Wind + Solar + 6 Random Variables |
0.6458 |
| Ozone ~ Temp + Wind + Solar + 7 Random Variables |
0.6460 |
| Ozone ~ Temp + Wind + Solar + 8 Random Variables |
0.6488 |
Perfect - My Models RULE
I can just add random variables to my model and always get an awesome fit!
![]()
Not so fast Bevis!
Model Comparisons
Akaike Information Criterion (AIC) is a measurement that allows us to compare models while penalizing for adding new parameters.
\[AIC = -2 \ln L + 2p\]
The criterion here are to find models with the lowest (absolute) AIC values.
Model Comparisons
To compare, we evaluate the differences in AIC for alternative models.
\(\delta AIC = AIC - min( AIC )\)
Interpretation
- \(0 < \delta AIC\; < 2.0\): Models for consideration.
- \(2.0 < \delta AIC\; < 5.0\): May be of interest.
- \(\delta AIC > 5.0\): Not to be considered.
Questions
If you have any questions, please feel free to either post them as an “Issue” on your copy of this GitHub Repository, post to the Canvas discussion board for the class, or drop me an email.